Lecture 2: Logistic Regression (Part I)
\(N = 3154\) healthy young men aged 39–59 from the San Francisco area were assessed for their personality type. All were free from coronary heart disease at the start of the research. Eight and a half years later change in this situation was recorded.
wcgs data'data.frame': 3154 obs. of 13 variables:
$ age : int 49 42 42 41 59 44 44 40 43 42 ...
$ height : int 73 70 69 68 70 72 72 71 72 70 ...
$ weight : int 150 160 160 152 150 204 164 150 190 175 ...
$ sdp : int 110 154 110 124 144 150 130 138 146 132 ...
$ dbp : int 76 84 78 78 86 90 84 60 76 90 ...
$ chol : int 225 177 181 132 255 182 155 140 149 325 ...
$ behave : Factor w/ 4 levels "A1","A2","B3",..: 2 2 3 4 3 4 4 2 3 2 ...
$ cigs : int 25 20 0 20 20 0 0 0 25 0 ...
$ dibep : Factor w/ 2 levels "A","B": 1 1 2 2 2 2 2 1 2 1 ...
$ chd : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
$ typechd: Factor w/ 4 levels "angina","infdeath",..: 3 3 3 3 2 3 3 3 3 3 ...
$ timechd: int 1664 3071 3071 3064 1885 3102 3074 3071 3064 1032 ...
$ arcus : Factor w/ 2 levels "absent","present": 1 2 1 1 2 1 1 1 1 2 ...
age: age in yearsheight: height in inchesweight: weight in poundssdp: systolic blood pressure in mm Hgdbp: diastolic blood pressure in mm Hgchol: fasting serum cholesterol in mm %behave: behavior type which is a factor with levels A1 A2 B3 B4cigs: number of cigarettes smoked per daydibep: behavior type a factor with levels A (Agressive) B (Passive)chd: coronary heat disease developed is a factor with levels no yestypechd: type of coronary heart disease is a factor with levels angina infdeath none silenttimechd: time of CHD event or end of follow-uparcus: arcus senilis is a factor with levels absent present chd height cigs
no :2897 Min. :60.00 Min. : 0.0
yes: 257 1st Qu.:68.00 1st Qu.: 0.0
Median :70.00 Median : 0.0
Mean :69.78 Mean :11.6
3rd Qu.:72.00 3rd Qu.:20.0
Max. :78.00 Max. :99.0
Anything interesting stick out?
Pie chart of response 🤮
no yes
0.91851617 0.08148383
Bar charts tend to be more effective
Mosaic plot showing relationship between cigs and chd; not incredibly useful IMO unless both variables are categorical
Nonparametric density plot of height by chd status
Boxplot of cigs vs. chd status
Detour: decision trees are immensely useful tools for exploring new data sets
Useful resource: http://pages.stat.wisc.edu/~loh/treeprogs/guide/LohISI14.pdf
Ultimate reference: https://bgreenwell.github.io/treebook/ 😄
Standard CART-like decision tree
Conditional inference tree using only variables of interest
cigs is positively associated with the binary response chdheight is associated with chdRecall that in linear regression we model the conditional mean response as a linear function in some fixed, but known parameters \(\boldsymbol{\beta}\):
\[ E\left(Y|\boldsymbol{x}\right) = \beta_0 + \beta_1x_1 + \dots \beta_px_p = \boldsymbol{\beta}^\top\boldsymbol{x} \]
If \(Y\) is a binary random variable, then what is \(E\left(Y|\boldsymbol{x}\right)\)?
It turns out that \(E\left(Y|\boldsymbol{x}\right) = P\left(Y = 1|\boldsymbol{x}\right)\). The LP model assumes that \[ P\left(Y = 1|\boldsymbol{x}\right) = \beta_0 + \beta_1x_1 + \dots \beta_px_p = \boldsymbol{\beta}^\top\boldsymbol{x} \]
Is this reasonable?
chd vs. cigs
Call:
lm(formula = y ~ cigs, data = wcgs)
Residuals:
Min 1Q Median 3Q Max
-0.25268 -0.09794 -0.05876 -0.05876 0.94124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0587608 0.0062041 9.471 < 2e-16 ***
cigs 0.0019588 0.0003339 5.867 4.91e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2722 on 3152 degrees of freedom
Multiple R-squared: 0.0108, Adjusted R-squared: 0.01049
F-statistic: 34.42 on 1 and 3152 DF, p-value: 4.91e-09
Are the standard errors here appropriate? Why/why not?
chd vs. cigschd vs. cigsAssume \(Y \sim \mathrm{Bernoulli}\left(p\right)\), where \[ p = p\left(\boldsymbol{x}\right) = P\left(Y = 1|\boldsymbol{x}\right) \] and \[ \mathrm{logit}\left(p\right) = \log\left(\frac{p}{1-p}\right) = \boldsymbol{\beta}^\top\boldsymbol{x} \] In other words, LR models the logit of the mean response as a linear function in \(\boldsymbol{\beta}\); we’ll refer to the term \(\eta = \boldsymbol{\beta}^\top\boldsymbol{x}\) as the linear predictor. Why does this make more sense?
Can always solve for \(p\) to get predictions on the raw probability scale: \[ p\left(\boldsymbol{x}\right) = \frac{\exp\left(\boldsymbol{\beta}^\top\boldsymbol{x}\right)}{1 + \exp\left(\boldsymbol{\beta}^\top\boldsymbol{x}\right)} \]
Note how the LR model is nonlinear in \(p\)!
glm() function instead of lm()family argument! (see ?glm for details)
Call:
glm(formula = chd ~ cigs, family = binomial, data = wcgs)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9958 -0.4418 -0.3534 -0.3534 2.3684
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.742160 0.092111 -29.770 < 2e-16 ***
cigs 0.023220 0.004042 5.744 9.22e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1781.2 on 3153 degrees of freedom
Residual deviance: 1750.0 on 3152 degrees of freedom
AIC: 1754
Number of Fisher Scoring iterations: 5
Call:
glm(formula = chd ~ cigs + height, family = binomial, data = wcgs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0041 -0.4425 -0.3630 -0.3499 2.4357
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.50161 1.84186 -2.444 0.0145 *
cigs 0.02313 0.00404 5.724 1.04e-08 ***
height 0.02521 0.02633 0.957 0.3383
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1781.2 on 3153 degrees of freedom
Residual deviance: 1749.0 on 3151 degrees of freedom
AIC: 1755
Number of Fisher Scoring iterations: 5
The logit models the log odds of success (i.e., \(Y = 1|\boldsymbol{x}\)) \[ \log\left(\mathrm{odds}\right) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots \beta_px_p \] Exponentiating, we get \[ \mathrm{odds} = \frac{p}{1-p} = \exp{\left(\beta_0\right)}\times\exp{\left(\beta_1x_1\right)}\times\exp{\left(\beta_2x_2\right)}\times\dots\times\exp{\left(\beta_px_p\right)} \]
Holding height constant, for every additional cigarette smoked per day the predicted log odds of developing chd increase by 0.023
Holding height constant, for every additional cigarette smoked per day the predicted odds of developing chd increase multiplicatively by 1.023
Lot’s of different methods and packages: * Marginal effects via effects library * Partial dependence (PD) plots and individual conditional expectation (ICE) plots via the pdp package * Marginal effect and PD plots via the plotmo library * And many, many more * Generally similar in shape when the model is additive in nature (i.e., no interaction effects)
The plotmo library is an “easy button” for quick and rough effect plots (other variables are held fixed at their median value) and supports a wide range of models
I generally prefer PD plots; see Greenwell (2017) for details